library("tidyverse")
library("ggridges")
library("RColorBrewer")06_analysis_2
Load Libraries
Load Data
Loading the data from .csv file that has been downloaded.
df <- read_csv("../_raw/project_data.csv")Save loaded data
Saving the dataset as .tsv file in data directory. Also reading the saved file to make sure it works.
write_tsv(df, "../data/01_dat_load.tsv")
df <- read_tsv("../data/01_dat_load.tsv")library("tidyverse")Checking for missing values
In the first part of data cleaning, we focus on identifying and removing any missing values in our dataset. Missing data can significantly impact the quality of our analysis, leading to biased or inaccurate results.
df <- read_tsv("../data/01_dat_load.tsv")
df_cleaned <- df |>
drop_na()
rows_removed <- nrow(df) - nrow(df_cleaned)
cat("Rows removed because of missing values:", rows_removed, "\n")Rows removed because of missing values: 0
From the first part we can see that the dataset didn’t have any missing variables. If there would have been any missing values, we would have removed those rows with missing values.
Checking for right type and values of each variable
In the second part we want to ensure that each of the variables in the dataset has expected type. By doing this we can make sure that we don’t have any faulty variables and we understand our variables for further analysis.
column_types <- df_cleaned |>
summarise(across(everything(), class))
print(column_types)# A tibble: 1 × 22
Diabetes_binary HighBP HighChol CholCheck BMI Smoker Stroke
<chr> <chr> <chr> <chr> <chr> <chr> <chr>
1 numeric numeric numeric numeric numeric numeric numeric
# ℹ 15 more variables: HeartDiseaseorAttack <chr>, PhysActivity <chr>,
# Fruits <chr>, Veggies <chr>, HvyAlcoholConsump <chr>, AnyHealthcare <chr>,
# NoDocbcCost <chr>, GenHlth <chr>, MentHlth <chr>, PhysHlth <chr>,
# DiffWalk <chr>, Sex <chr>, Age <chr>, Education <chr>, Income <chr>
We can see that all of our variables are numeric.
Filtering incorrect values
In the third part of cleaning we want to filter out incorrect values. With the dataset received we also received range of values each numeric variable. So this part is all about making sure that each of the value falls into correct range of values.
original_df <- df_cleaned
df_cleaned <- df_cleaned |>
filter(Diabetes_binary %in% c(0, 1, 2),
HighBP %in% c(0, 1),
HighChol %in% c(0, 1),
CholCheck %in% c(0, 1),
BMI > 0,
Smoker %in% c(0, 1),
Stroke %in% c(0, 1),
HeartDiseaseorAttack %in% c(0, 1),
PhysActivity %in% c(0, 1),
Fruits %in% c(0, 1),
Veggies %in% c(0, 1),
HvyAlcoholConsump %in% c(0, 1),
AnyHealthcare %in% c(0, 1),
NoDocbcCost %in% c(0, 1),
GenHlth %in% 1:5,
MentHlth %in% 0:30,
PhysHlth %in% 0:30,
DiffWalk %in% c(0, 1),
Sex %in% c(0, 1),
Age %in% 1:13,
Education %in% 1:6,
Income %in% 1:8)
rows_removed_after_filter <- nrow(original_df) - nrow(df_cleaned)
cat("Rows removed because of incorrect values:", rows_removed_after_filter, "\n")Rows removed because of incorrect values: 0
From this part we can see that the dataset didn’t have any out of range values for all of the variables. If there would have been any incorrect values, we would have removed those rows with incorrect values.
Writing the file
write_tsv(df, "../data/02_dat_clean.tsv")library("tidyverse")Load Data
df_cleaned <- read_tsv("../data/02_dat_clean.tsv")Adding new variables
Smoking
Changing the smokers variable from binary format to character. The value “0” which indicates non-smokers is changed to “Non-Smoker”, and the value “1” changed to “Smoker” respectively.
df_aug <- df_cleaned |>
mutate(Smoking_Status = case_when(
Smoker == 0 ~ "Non-Smoker",
Smoker == 1 ~ "Smoker"
))Diabetes
Converting the diabetes_binary variable from binary to character.
df_aug <-
df_aug |>
mutate(Diabetes_Status = case_when(
Diabetes_binary == 0 ~ "Non-Diabetic",
Diabetes_binary == 1 ~ "Diabetic"))Gender
Changing the Sex variable from binary to character.
df_aug <-
df_aug |>
mutate(Sex_character = case_when(
Sex == 0 ~ "Female",
Sex == 1 ~ "Male"))Age
The age was changed from the 13-level age category to the corresponding range values.
df_aug <-
df_aug |>
mutate(Age_Range = case_when(
Age == 1 ~ "18-24",
Age == 2 | Age == 3 ~ "25-34",
Age == 4 | Age == 5 ~ "35-44",
Age == 6 | Age == 7 | Age == 8 ~ "45-59",
Age == 9 | Age == 10 | Age == 11 ~ "60-74",
Age == 12 | Age == 13 ~ "75-"))Income
Income variable changed from a scale from 1-9 to three classes: Poor, Average and Wealthy.
df_aug <-
df_aug |>
mutate(Income_Class = case_when(
Income <= 3 ~ "Poor",
Income > 3 & Income <= 6 ~ "Average",
Income > 6 ~ "Wealthy"))Physical Activity
This variable transforms the “PhysActivity” from binary to character.
df_aug <-
df_aug |>
mutate(Physically_Active = case_when(
PhysActivity == 0 ~ "No",
PhysActivity == 1 ~ "Yes"))Habits
One variable that should be added is habits, a character variable that describes whether the lifestyle of the individual is healthy or not. This depends on many variables, namely Smoking, Alcohol consumption, Fruits and Veggies consumption and Physical Activity. For the first 2, we assigned -1 point, while for the latter 3, we assigned +1 respectively. A high score indicates a healthy lifestyle, a medium one indicates an average lifestyle, while a low score indicates an unhealthy one.
df_aug <-
df_aug |>
mutate(Habit_Score = Veggies + Fruits + PhysActivity - Smoker - HvyAlcoholConsump) |>
mutate(Habits = case_when(
Habit_Score < 0 ~ "Unhealthy",
Habit_Score >= 0 & Habit_Score < 2 ~ "Average",
Habit_Score >= 2 ~ "Healthy"),
.keep = "unused")Health risk
The health risk depends on the prevalence of a heart attack/disease, stroke, high BP and high cholesterol. This variable weights the heart disease and stroke variables more than the others.
df_aug <-
df_aug |>
mutate(Risk_Score = HighBP + HighChol + 2*Stroke + 2*HeartDiseaseorAttack) |>
mutate(Health_Risk = case_when(
Risk_Score < 2 ~ "Low Risk",
Risk_Score >= 2 & Risk_Score < 4 ~ "Medium Risk",
Risk_Score >= 4 ~ "High Risk"),
.keep = "unused")Socio-economical Class
At first we created a binary variable for the educational background. 0 refers to individuals who didn’t attend school or attend school and didn’t graduate high-school, while 1 refers to those who graduated high school and maybe received higher education.
df_aug <-
df_aug |>
mutate(Education_binary = case_when(
Education < 4 ~ 0,
Education >= 4 ~ 1)) |>
mutate(SE_Score = Income + AnyHealthcare + Education_binary + PhysActivity) |>
mutate(SE_Background = case_when(
SE_Score < 6 ~ "Lower Class",
SE_Score >= 6 ~ "Higher Class"),
.keep = "unused")Writing the file
write_tsv(df_aug, file = "../data/03_dat_aug.tsv")library("tidyverse")
library("ggridges")
library("RColorBrewer")Load data
df_aug <- read_tsv("../data/03_dat_aug.tsv")Visualization
- The distribution of BMI in the different age groups, including a comparison between smokers and non-smokers.
plot_1 <-
df_aug |>
ggplot(aes(x = BMI, y = Age_Range)) +
geom_density_ridges(aes(fill = Age_Range) ,alpha=0.6) +
facet_wrap(~Smoking_Status)+
labs(x = "BMI",
y = "Age Group") +
theme_minimal(base_size = 14)+
theme(legend.position = "none",
plot.background = element_rect(fill = "white", color="white")) +
scale_fill_viridis_d(option = "rocket") +
labs(title = "BMI distribution over the different age groups",
subtitle = "A comparison between smokers and non-smokers")
plot_1Picking joint bandwidth of 0.972
Picking joint bandwidth of 1.18
As it is illustrated from the plot above, smoking status affects mostly the youngest age groups, namely 18-24 and 25-34, where it is shown that smokers have a generally higher BMI than the non-smokers. For the other age groups, the behavior is almost the same between smokers and non-smokers, so smoking doesn’t appear as a main factor influencing BMI for individuals over 35. However, the general tendency is that BMI increases over the age, as it can be seen by the peak.
- Comparison of BMI between income classes, divided into individuals with and without physical activity.
df_aug |>
ggplot(mapping = aes(x = BMI,
y = fct_relevel(Income_Class,"Poor","Average","Wealthy"))) +
geom_boxplot(aes(fill = Physically_Active),
outlier.shape = NA,,alpha=0.6) +
xlim(0,60) +
xlab("BMI") +
ylab("Income Class") +
theme(legend.position = "none") +
scale_fill_manual(values = c("aquamarine", "blue")) +
labs(title = "Comparison of BMI between different income classes",
subtitle = "Red: No Physical Activity, Blue: Physical Activity")Warning: Removed 260 rows containing non-finite values (`stat_boxplot()`).
The graph displays a general decrease of BMI in physically active individuals, comparing to inactive ones. Also, it is evident that BMI gradually increases from the wealthiest to the poorest income group, regardless of the physical activity status.
- Distribution of habits among individuals between genders
plot_2 <-
df_aug |>
ggplot(mapping = aes(x = fct_relevel(Habits,"Unhealthy","Average","Healthy"),
fill = Sex_character)) +
geom_bar(position = position_dodge(),alpha=0.8) +
facet_wrap(~Diabetes_Status) +
theme_minimal(base_size = 14)+
scale_fill_manual(name = "Gender", values = c("midnightblue", "mediumorchid2")) +
#scale_fill_manual(values = c("antiquewhite4", "coral2"))
#scale_fill_viridis_d() +
theme(legend.position = "right",
plot.background = element_rect(fill = "white", color="white")) +
ylab("Count") +
xlab("Habits") +
labs(title = "The distribution of habits among individuals",
subtitle = "Comparison between 2 genders")
plot_2Among the individuals with healthy habits, it is illustrated that, in both diabetic and non-diabetic cases, more women have healthier habits, whereas in the “Average” habits group, the results are more conflicting. It can be seen that while in the non-diabetic cases, more women have an “Average” lifestyle, it is the opposite for the diabetic counterparts. It is also remarkable that female individuals in the non-diabetic group doubled from the “Average” to the “Healthy” group, while the increase is slighter in the diabetic cases.
On the contrary, regarding the male counterparts, among the diabetic cases, there is not an important change between “Average” and “Healthy” groups, while in the non-diabetic cases, the “Healthy” cases where nearly 2000 more than the “Average” ones.
Saving Plots
plot_1 |>
ggsave(filename = "04_key_plot_1.png", path = "../results/")Saving 7 x 5 in image
Picking joint bandwidth of 0.972
Picking joint bandwidth of 1.18
plot_2 |>
ggsave(filename = "04_key_plot_2.png", path = "../results/")Saving 7 x 5 in image
library("tidyverse")
library("broom")Load data
df <- read_tsv("../data/03_dat_aug.tsv")Rows: 70692 Columns: 32
── Column specification ────────────────────────────────────────────────────────
Delimiter: "\t"
chr (9): Smoking_Status, Diabetes_Status, Sex_character, Age_Range, Income_...
dbl (23): Diabetes_binary, HighBP, HighChol, CholCheck, BMI, Smoker, Stroke,...
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
table(df$Diabetes_binary)
0 1
35346 35346
As it can be seen, the original dataset has an equal 50-50 split of respondents with no diabetes with either prediabetes or diabetes. The target variable Diabates_binary has 2 classes. 0 is for no diabetes, and 1 is for prediabetes or diabetes.
This script will the first one related to the analysis part. The focus here will be on interpreting the correlation coefficients to see what interaction exists between the variables and how they affect the variable related to the diagnosis of diabetes (Diabetes_binary).
The analysis will continue with the creation of a general linear model. The model will be tested with all available variables and an interpretation of the results obtained will be made. This script will end with the use of the “step” method in order to find out which are the best variables according to our data. What is more, a comparison between models will be made as well regarding to AIC and the selected variables.
Correlation analysis
Thus, what is correlation? Correlation is an statistical measure of a relationship involving two variables. Specifically, correlation reffers to a linear relationship between two independent variables. The correlation coefficient is the numerical measure of a statistical correlation and indicates the strength of the relationship.
One of the major restrictions of correlation is that it measures relationships only between numerical variables. If the relationship between categorical variables wants to be made, performing a chi-square test would be a good option. However, in this case, a dataset only containing numerical variables will be needed.
df_numeric <- df |>
select_if(is.numeric)There are several correlation coefficient formulas such as the Sample correlation coefficient, the Population correlation coefficient or the Rank correlation coefficient. The formula selected for this analysis is the Pearson product-moment correlation. Also known as the Pearson correlation. This is the most common correlation coefficient and it can be computed for any data set that has a finite covariance matrix. To achieve the values, a division between the covariance of two variables by the product of their standard deviations has to be made.
The following code computes the Pearson correlation between the numerical variables of the data set and shows the obtained values.
cor_matrix <- cor(df_numeric, method = "spearman")
print("Correlation Matrix:")[1] "Correlation Matrix:"
print(cor_matrix) Diabetes_binary HighBP HighChol CholCheck
Diabetes_binary 1.00000000 0.38151555 0.28921281 0.1153816171
HighBP 0.38151555 1.00000000 0.31651485 0.1032832913
HighChol 0.28921281 0.31651485 1.00000000 0.0859813978
CholCheck 0.11538162 0.10328329 0.08598140 1.0000000000
BMI 0.32835561 0.27554596 0.15833925 0.0543965663
Smoker 0.08599896 0.08743830 0.09339831 -0.0043305151
Stroke 0.12542678 0.12905987 0.09978619 0.0225293810
HeartDiseaseorAttack 0.21152340 0.21075039 0.18118664 0.0434971444
PhysActivity -0.15866560 -0.13610217 -0.09045316 -0.0082493633
Fruits -0.05407656 -0.04085216 -0.04738362 0.0173838322
Veggies -0.07929315 -0.06662374 -0.04283626 0.0003492577
HvyAlcoholConsump -0.09485314 -0.02702989 -0.02544298 -0.0271461602
AnyHealthcare 0.02319075 0.03576444 0.03153180 0.1068004249
NoDocbcCost 0.04097657 0.02651701 0.03319927 -0.0626687433
GenHlth 0.41425866 0.32403793 0.23913786 0.0602571737
MentHlth 0.05651530 0.03889704 0.07060833 -0.0169596205
PhysHlth 0.21550736 0.17710138 0.14880130 0.0326525481
DiffWalk 0.27264601 0.23478391 0.16204341 0.0444303666
Sex 0.04441286 0.04081925 0.01732446 -0.0079912226
Age 0.26161996 0.32217979 0.21797418 0.0992080572
Education -0.16992567 -0.14354264 -0.08317736 -0.0076929007
Income -0.23252895 -0.19689594 -0.11028910 0.0072711196
Education_binary -0.10231756 -0.07871780 -0.05001763 -0.0062569876
BMI Smoker Stroke
Diabetes_binary 0.328355613 0.085998964 0.125426785
HighBP 0.275545956 0.087438297 0.129059872
HighChol 0.158339249 0.093398312 0.099786191
CholCheck 0.054396566 -0.004330515 0.022529381
BMI 1.000000000 0.022429012 0.026133378
Smoker 0.022429012 1.000000000 0.064658397
Stroke 0.026133378 0.064658397 1.000000000
HeartDiseaseorAttack 0.074474520 0.124417535 0.223393786
PhysActivity -0.169264726 -0.079823258 -0.079984782
Fruits -0.091424715 -0.074810805 -0.008996297
Veggies -0.062669226 -0.029925651 -0.047601204
HvyAlcoholConsump -0.063087867 0.077835011 -0.023394552
AnyHealthcare -0.005136231 -0.012938633 0.006483591
NoDocbcCost 0.061142416 0.035798890 0.036198325
GenHlth 0.284265010 0.149959748 0.182517254
MentHlth 0.086931776 0.062012320 0.068190661
PhysHlth 0.166514023 0.101202841 0.150871563
DiffWalk 0.233130063 0.119788712 0.192265923
Sex 0.033497623 0.112125182 0.003822095
Age -0.038140924 0.098822294 0.123544192
Education -0.118629606 -0.152917687 -0.071888693
Income -0.118986237 -0.112240530 -0.134486538
Education_binary -0.045619597 -0.060857918 -0.055726301
HeartDiseaseorAttack PhysActivity Fruits
Diabetes_binary 0.21152340 -0.158665605 -0.054076556
HighBP 0.21075039 -0.136102169 -0.040852164
HighChol 0.18118664 -0.090453163 -0.047383624
CholCheck 0.04349714 -0.008249363 0.017383832
BMI 0.07447452 -0.169264726 -0.091424715
Smoker 0.12441753 -0.079823258 -0.074810805
Stroke 0.22339379 -0.079984782 -0.008996297
HeartDiseaseorAttack 1.00000000 -0.098223332 -0.019435823
PhysActivity -0.09822333 1.000000000 0.133812914
Fruits -0.01943582 0.133812914 1.000000000
Veggies -0.03631473 0.149322340 0.238605287
HvyAlcoholConsump -0.03712977 0.019110751 -0.033245681
AnyHealthcare 0.01568748 0.027088786 0.029385153
NoDocbcCost 0.03602872 -0.063301661 -0.045842586
GenHlth 0.26881454 -0.270967210 -0.099558905
MentHlth 0.05122788 -0.097845344 -0.060024855
PhysHlth 0.18496138 -0.208012836 -0.050328104
DiffWalk 0.23261057 -0.276868259 -0.050784243
Sex 0.09816137 0.051752891 -0.088723101
Age 0.22523062 -0.092685040 0.067777788
Education -0.09372717 0.193831150 0.104653290
Income -0.14892375 0.202370499 0.076731680
Education_binary -0.06337927 0.100844116 0.042187139
Veggies HvyAlcoholConsump AnyHealthcare NoDocbcCost
Diabetes_binary -0.0792931456 -0.094853140 0.023190749 0.040976573
HighBP -0.0666237364 -0.027029886 0.035764442 0.026517009
HighChol -0.0428362633 -0.025442979 0.031531795 0.033199272
CholCheck 0.0003492577 -0.027146160 0.106800425 -0.062668743
BMI -0.0626692263 -0.063087867 -0.005136231 0.061142416
Smoker -0.0299256507 0.077835011 -0.012938633 0.035798890
Stroke -0.0476012036 -0.023394552 0.006483591 0.036198325
HeartDiseaseorAttack -0.0363147331 -0.037129769 0.015687481 0.036028720
PhysActivity 0.1493223396 0.019110751 0.027088786 -0.063301661
Fruits 0.2386052867 -0.033245681 0.029385153 -0.045842586
Veggies 1.0000000000 0.022090472 0.029152204 -0.037146195
HvyAlcoholConsump 0.0220904721 1.000000000 -0.013483953 0.009682557
AnyHealthcare 0.0291522037 -0.013483953 1.000000000 -0.221657573
NoDocbcCost -0.0371461952 0.009682557 -0.221657573 1.000000000
GenHlth -0.1172584145 -0.058740518 -0.034006408 0.166006719
MentHlth -0.0407673316 0.024335944 -0.046540852 0.192201880
PhysHlth -0.0609970471 -0.035878483 -0.005765732 0.168141140
DiffWalk -0.0840715623 -0.049293977 0.008113322 0.127110921
Sex -0.0526037770 0.014164398 -0.006561785 -0.048186618
Age -0.0172075629 -0.056264226 0.141017473 -0.141049733
Education 0.1535388035 0.033426171 0.100534473 -0.094148431
Income 0.1558164176 0.066585569 0.129948474 -0.195719144
Education_binary 0.0857534413 0.030315004 0.076029518 -0.071632300
GenHlth MentHlth PhysHlth DiffWalk
Diabetes_binary 0.41425866 0.05651530 0.215507356 0.272646006
HighBP 0.32403793 0.03889704 0.177101385 0.234783906
HighChol 0.23913786 0.07060833 0.148801297 0.162043410
CholCheck 0.06025717 -0.01695962 0.032652548 0.044430367
BMI 0.28426501 0.08693178 0.166514023 0.233130063
Smoker 0.14995975 0.06201232 0.101202841 0.119788712
Stroke 0.18251725 0.06819066 0.150871563 0.192265923
HeartDiseaseorAttack 0.26881454 0.05122788 0.184961384 0.232610574
PhysActivity -0.27096721 -0.09784534 -0.208012836 -0.276868259
Fruits -0.09955890 -0.06002486 -0.050328104 -0.050784243
Veggies -0.11725841 -0.04076733 -0.060997047 -0.084071562
HvyAlcoholConsump -0.05874052 0.02433594 -0.035878483 -0.049293977
AnyHealthcare -0.03400641 -0.04654085 -0.005765732 0.008113322
NoDocbcCost 0.16600672 0.19220188 0.168141140 0.127110921
GenHlth 1.00000000 0.26951341 0.524274479 0.466539876
MentHlth 0.26951341 1.00000000 0.345336929 0.224159145
PhysHlth 0.52427448 0.34533693 1.000000000 0.461953961
DiffWalk 0.46653988 0.22415915 0.461953961 1.000000000
Sex -0.01403767 -0.12812104 -0.070312080 -0.082248325
Age 0.13875512 -0.16792884 0.044261547 0.182841621
Education -0.28153156 -0.06789651 -0.137528550 -0.199614252
Income -0.38233755 -0.17188337 -0.257596607 -0.336425444
Education_binary -0.18452702 -0.05223709 -0.095200381 -0.140369342
Sex Age Education Income
Diabetes_binary 0.0444128584 0.2616199574 -0.169925666 -0.23252895
HighBP 0.0408192533 0.3221797895 -0.143542637 -0.19689594
HighChol 0.0173244557 0.2179741750 -0.083177359 -0.11028910
CholCheck -0.0079912226 0.0992080572 -0.007692901 0.00727112
BMI 0.0334976227 -0.0381409244 -0.118629606 -0.11898624
Smoker 0.1121251819 0.0988222937 -0.152917687 -0.11224053
Stroke 0.0038220952 0.1235441922 -0.071888693 -0.13448654
HeartDiseaseorAttack 0.0981613669 0.2252306162 -0.093727165 -0.14892375
PhysActivity 0.0517528912 -0.0926850399 0.193831150 0.20237050
Fruits -0.0887231011 0.0677777878 0.104653290 0.07673168
Veggies -0.0526037770 -0.0172075629 0.153538803 0.15581642
HvyAlcoholConsump 0.0141643977 -0.0562642263 0.033426171 0.06658557
AnyHealthcare -0.0065617851 0.1410174725 0.100534473 0.12994847
NoDocbcCost -0.0481866180 -0.1410497328 -0.094148431 -0.19571914
GenHlth -0.0140376652 0.1387551158 -0.281531564 -0.38233755
MentHlth -0.1281210369 -0.1679288439 -0.067896513 -0.17188337
PhysHlth -0.0703120801 0.0442615473 -0.137528550 -0.25759661
DiffWalk -0.0822483253 0.1828416211 -0.199614252 -0.33642544
Sex 1.0000000000 0.0004467795 0.046773685 0.15796710
Age 0.0004467795 1.0000000000 -0.102378905 -0.17295027
Education 0.0467736846 -0.1023789049 1.000000000 0.46334862
Income 0.1579671041 -0.1729502656 0.463348619 1.00000000
Education_binary 0.0208576036 -0.0637010995 0.473632268 0.27570314
Education_binary
Diabetes_binary -0.102317558
HighBP -0.078717804
HighChol -0.050017633
CholCheck -0.006256988
BMI -0.045619597
Smoker -0.060857918
Stroke -0.055726301
HeartDiseaseorAttack -0.063379273
PhysActivity 0.100844116
Fruits 0.042187139
Veggies 0.085753441
HvyAlcoholConsump 0.030315004
AnyHealthcare 0.076029518
NoDocbcCost -0.071632300
GenHlth -0.184527017
MentHlth -0.052237090
PhysHlth -0.095200381
DiffWalk -0.140369342
Sex 0.020857604
Age -0.063701100
Education 0.473632268
Income 0.275703138
Education_binary 1.000000000
As there are many values, a plot will be displayed to make better assumptions. The selected plot has been a heatmap. But how are we supposed to interpret it? The correlation sets the ground to quantify the sign and the magnitude of the tendency between two variables.
The sign denotes the direction of the variable relationship.
-> Values above 0 show a direct or positive relationship between the variables,
-> Values below 0 show an indirect or negative relationship,
-> A null value shows that there does not exist any tendency between both variables.
The magnitude indicates the strength of the relationship. If the magnitude value is close to the extreme of the interval (1 or -1) the trend the trend of the variables is higher. As the correlation efficient reaches zero, the trend minimizes.
-> If the correlation value takes the value 1 or -1, we will say that the correlation is “perfect”,
-> If the correlation value takes the value 0, we will say that the variables are not correlated.
Now that it is known how to interpret the correlation coefficients lets see the plot and analyze it.
ggplot(data = as.data.frame(as.table(cor_matrix)),
aes(x = Var1, y = Var2, fill = Freq)) +
geom_tile() +
scale_fill_gradient2(low = "blue",
high = "red",
mid = "white",
midpoint = 0,
limit = c(-1, 1),
space = "Lab",
name = "Correlation") +
theme_minimal() +
theme(axis.text.x = element_text(angle = 45, hjust = 1))The redder the colour, the stronger the correlation in a positive way and the bluer the colour, the stronger but in a negative way. At first glance it can be seen that the health-related variables are positively related to each other. We are talking for example about the variables PhysHlth, GenHlth and DiffWalk and that there are no pairs of variables whose relationship is particularly negative.
But what does it mean to be positively or negatively correlated? Lets see it with some examples:
Positive correlations may include the relationship between beer sales and the temperature, implying hot weather is related to an increase in cold beer consumption.
Negative correlations might look like the relationship between time spent going partying before doing an exam and a student’s exams scores.
Null hypothesis are simple relationships that have nothing in common, such as the height of a person and their exams scores.
We will check now the highers values for the achieved correlations.
cor_data <- as.data.frame(as.table(cor_matrix))
strong_correlations <- cor_data |>
filter(Freq > 0.35 | Freq < -0.35)
print("Strong Correlations:")[1] "Strong Correlations:"
print(strong_correlations) Var1 Var2 Freq
1 Diabetes_binary Diabetes_binary 1.0000000
2 HighBP Diabetes_binary 0.3815155
3 GenHlth Diabetes_binary 0.4142587
4 Diabetes_binary HighBP 0.3815155
5 HighBP HighBP 1.0000000
6 HighChol HighChol 1.0000000
7 CholCheck CholCheck 1.0000000
8 BMI BMI 1.0000000
9 Smoker Smoker 1.0000000
10 Stroke Stroke 1.0000000
11 HeartDiseaseorAttack HeartDiseaseorAttack 1.0000000
12 PhysActivity PhysActivity 1.0000000
13 Fruits Fruits 1.0000000
14 Veggies Veggies 1.0000000
15 HvyAlcoholConsump HvyAlcoholConsump 1.0000000
16 AnyHealthcare AnyHealthcare 1.0000000
17 NoDocbcCost NoDocbcCost 1.0000000
18 Diabetes_binary GenHlth 0.4142587
19 GenHlth GenHlth 1.0000000
20 PhysHlth GenHlth 0.5242745
21 DiffWalk GenHlth 0.4665399
22 Income GenHlth -0.3823375
23 MentHlth MentHlth 1.0000000
24 GenHlth PhysHlth 0.5242745
25 PhysHlth PhysHlth 1.0000000
26 DiffWalk PhysHlth 0.4619540
27 GenHlth DiffWalk 0.4665399
28 PhysHlth DiffWalk 0.4619540
29 DiffWalk DiffWalk 1.0000000
30 Sex Sex 1.0000000
31 Age Age 1.0000000
32 Education Education 1.0000000
33 Income Education 0.4633486
34 Education_binary Education 0.4736323
35 GenHlth Income -0.3823375
36 Education Income 0.4633486
37 Income Income 1.0000000
38 Education Education_binary 0.4736323
39 Education_binary Education_binary 1.0000000
As mentioned before, most of the highest correlation values refer to the relationship of variables in the world of health. The higher positive value is achieved between PhysHlth and GenHlth. This means that the lower value for PhysHlth the lower value it will be achieved in GenHlth (lower values in those variables are a good sign in this dataset). It’s something that makes a lot of sense. One thing that we found curious is the correlation between GenHlth and Income. Both get the most negative value, and with the previous explanation about what it means to be negatively correlated, this value tells us that the higher the value of GenHlth (a value of 5 means that health is poor) the smaller the value will be. Income (a value of 1 indicates that the patient does not have much money). Thus, this could lead to the conclusion that people with more money are more healthy.
Once this analysis has been carried out, we believe that it is interesting to simply look at the relationship of the rest of the variables with the variable associated with diabetes. We will follow exactly the same process as before.
df_numeric <- df |>
select_if(is.numeric) |>
select(Diabetes_binary, everything())
cor_with_diabetes <- df_numeric |>
cor(df_numeric$Diabetes_binary, method = "spearman")After generating the correlation matrix only related to the Diabetes_binary variable, we will generate a heatmap to take a look at the results.
plot_3 <- ggplot(data = as.data.frame(as.table(cor_with_diabetes)),
aes(x = Var2,
y = Var1,
fill = Freq)) +
geom_tile() +
scale_fill_gradient2(low = "blue",
high = "red",
mid = "white",
midpoint = 0,
limit = c(-1, 1),
space = "Lab",
name = "Correlation") +
theme_minimal() +
theme(axis.text.x = element_text(angle = 45,
hjust = 1))
plot_3We will print the results to have a better understanding of the results.
print(cor_with_diabetes) [,1]
Diabetes_binary 1.00000000
HighBP 0.38151555
HighChol 0.28921281
CholCheck 0.11538162
BMI 0.32835561
Smoker 0.08599896
Stroke 0.12542678
HeartDiseaseorAttack 0.21152340
PhysActivity -0.15866560
Fruits -0.05407656
Veggies -0.07929315
HvyAlcoholConsump -0.09485314
AnyHealthcare 0.02319075
NoDocbcCost 0.04097657
GenHlth 0.41425866
MentHlth 0.05651530
PhysHlth 0.21550736
DiffWalk 0.27264601
Sex 0.04441286
Age 0.26161996
Education -0.16992567
Income -0.23252895
Education_binary -0.10231756
Overall, we can achieve the same conclusions as before. Health related variables are positively correlated with diabates variable. It is something that everybody would expect but it was nice taking a look to it. Age has also something to do with having diabetes or not, as well as having difficulties to walk. Those are really interesting facts, but which are the top three correlated variables with diabetes?
cor_with_diabetes_data <- as.data.frame(as.table(cor_with_diabetes))
strong_correlations <- cor_with_diabetes_data |>
filter(Freq > 0.30 | Freq < -0.30)
print("Strong Correlations:")[1] "Strong Correlations:"
print(strong_correlations) Var1 Var2 Freq
1 Diabetes_binary A 1.0000000
2 HighBP A 0.3815155
3 BMI A 0.3283556
4 GenHlth A 0.4142587
To conclude the correlation analysis, we see that the top three most correlated variables with diabetes are (in order): GenHlth, HighBP and BMI. With these variables we could say that:
Having a low value in the GenHlth variable (the lowest the value, the healthier) will lead to have a low value in Diabates_binary (0 means no diabetes). Thus, the better the general health of a person is could lead to not having diabetes.
Having a low value in HighBP variable (0 means that you do not have high blood pressure) will lead to have a low value in Diabates_binary (0 means no diabetes). Thus, the lower the blood pressure of a person is could lead to not having diabetes.
Having a low value in BMI variable (low values indicate that your body mass is not high) will lead to have a low value in Diabates_binary (0 means no diabetes). Thus, the lower the BMI of a person is could lead to not having diabetes.
Generalized Linear Model (GLM)
In this section of the analysis we will work on two main tasks. The first one will be the creation and interpretation of a GLM with all the numerical variables available in the dataset. The second task will be related to applying some methods to improve this model and to check which are the most valuable features for the model.
Having said that, a Generalized Linear Model (GLM) is a statistical model used to analyze relationships between variables, accommodating various types of data and allowing for the prediction of one variable based on others. In our case, the data that we will use is the dataset formed by all the numerical variables. We want to predict whether a patient will have diabetes or not using all the available numerical variables. Finally, family = binomial indicates that we are fitting a model for binary or binomial response variables. This is suitable when the outcome variable is binary, meaning it has only two possible outcomes, such as success/failure, yes/no, or 0/1. In this case, our outcome variable is ‘Diabetes_binary’ which only has 0 (non-diabetic) and 1 (diabetic).
Lets create the model and interpret its results.
model_all <- glm(data = df_numeric,
Diabetes_binary ~ .,
family = binomial)tidy_summary <- tidy(model_all)
glance_summary <- glance(model_all)
full_summary <- bind_cols(tidy_summary, glance_summary)
print(full_summary)# A tibble: 23 × 13
term estimate std.error statistic p.value null.deviance df.null logLik
<chr> <dbl> <dbl> <dbl> <dbl> <dbl> <int> <dbl>
1 (Interc… -6.87 0.125 -55.1 0 98000. 70691 -36194.
2 HighBP 0.735 0.0197 37.3 9.85e-304 98000. 70691 -36194.
3 HighChol 0.587 0.0189 31.1 7.02e-213 98000. 70691 -36194.
4 CholChe… 1.36 0.0813 16.7 7.15e- 63 98000. 70691 -36194.
5 BMI 0.0756 0.00157 48.0 0 98000. 70691 -36194.
6 Smoker -0.00192 0.0189 -0.101 9.19e- 1 98000. 70691 -36194.
7 Stroke 0.162 0.0409 3.96 7.42e- 5 98000. 70691 -36194.
8 HeartDi… 0.253 0.0284 8.88 6.42e- 19 98000. 70691 -36194.
9 PhysAct… -0.0329 0.0213 -1.55 1.22e- 1 98000. 70691 -36194.
10 Fruits -0.0344 0.0196 -1.75 7.95e- 2 98000. 70691 -36194.
# ℹ 13 more rows
# ℹ 5 more variables: AIC <dbl>, BIC <dbl>, deviance <dbl>, df.residual <int>,
# nobs <int>
Running summary(model_all) gives us the summary of the fitted model. We will take a look to the most important aspects:
p-values: Low p-values (typically below 0.05) suggest that the corresponding predictor variable is statistically significant in predicting the outcome. These variables can be seen in the model summary with the *** symbol. Most of them are the ones related with health (as mentioned in the correlation analysis). Here it seems interesting that being a smoker or not is not significant for diabetes outcome. However, looks like ‘Sex’ and ‘Income’ are really relevant for the outcome.
Deviance and Residual Deviance: Deviance is a measure of model fit. We compare the deviance of our model to a null model (a model with no predictors) to assess how well our model explains the variability in the data. Lower deviance indicates a better fit and it can be seen that our model achieves that.
AIC (Akaike Information Criterion): AIC is a measure of model performance that penalizes for the number of parameters in the model. Lower AIC values indicate a better trade-off between model complexity and fit.
Now that we have our model with all the variables, we will try to improve it by applying the step method. This is a method that is used for selecting a subset of predictors to include in a model. This process involves adding or removing variables one at a time based on statistical criteria (such as p-values) to improve the model’s fit or simplicity. We will use two different approaches:
forward: Starts with no predictors and adds them one at a time, selecting the one that improves the model the most at each step.
backward: Starts with all predictors and removes them one at a time, excluding the one that contributes the least at each step.
Both methods aim to find an optimal subset of predictors for a model.
forward_model <- step(model_all,
direction = "forward")Start: AIC=72433.85
Diabetes_binary ~ HighBP + HighChol + CholCheck + BMI + Smoker +
Stroke + HeartDiseaseorAttack + PhysActivity + Fruits + Veggies +
HvyAlcoholConsump + AnyHealthcare + NoDocbcCost + GenHlth +
MentHlth + PhysHlth + DiffWalk + Sex + Age + Education +
Income + Education_binary
backward_model <- step(model_all,
direction = "backward")Start: AIC=72433.85
Diabetes_binary ~ HighBP + HighChol + CholCheck + BMI + Smoker +
Stroke + HeartDiseaseorAttack + PhysActivity + Fruits + Veggies +
HvyAlcoholConsump + AnyHealthcare + NoDocbcCost + GenHlth +
MentHlth + PhysHlth + DiffWalk + Sex + Age + Education +
Income + Education_binary
Df Deviance AIC
- Smoker 1 72388 72432
- Education_binary 1 72388 72432
- NoDocbcCost 1 72388 72432
- AnyHealthcare 1 72389 72433
<none> 72388 72434
- PhysActivity 1 72390 72434
- Fruits 1 72391 72435
- Veggies 1 72395 72439
- Education 1 72398 72442
- MentHlth 1 72399 72443
- Stroke 1 72404 72448
- DiffWalk 1 72408 72452
- PhysHlth 1 72437 72481
- HeartDiseaseorAttack 1 72468 72512
- Income 1 72516 72560
- Sex 1 72583 72627
- HvyAlcoholConsump 1 72636 72680
- CholCheck 1 72726 72770
- HighChol 1 73352 73396
- HighBP 1 73764 73808
- Age 1 73979 74023
- BMI 1 74997 75041
- GenHlth 1 75131 75175
Step: AIC=72431.86
Diabetes_binary ~ HighBP + HighChol + CholCheck + BMI + Stroke +
HeartDiseaseorAttack + PhysActivity + Fruits + Veggies +
HvyAlcoholConsump + AnyHealthcare + NoDocbcCost + GenHlth +
MentHlth + PhysHlth + DiffWalk + Sex + Age + Education +
Income + Education_binary
Df Deviance AIC
- Education_binary 1 72388 72430
- NoDocbcCost 1 72388 72430
- AnyHealthcare 1 72390 72432
<none> 72388 72432
- PhysActivity 1 72390 72432
- Fruits 1 72391 72433
- Veggies 1 72395 72437
- Education 1 72398 72440
- MentHlth 1 72399 72441
- Stroke 1 72404 72446
- DiffWalk 1 72408 72450
- PhysHlth 1 72437 72479
- HeartDiseaseorAttack 1 72468 72510
- Income 1 72516 72558
- Sex 1 72586 72628
- HvyAlcoholConsump 1 72639 72681
- CholCheck 1 72726 72768
- HighChol 1 73353 73395
- HighBP 1 73764 73806
- Age 1 73983 74025
- BMI 1 75001 75043
- GenHlth 1 75133 75175
Step: AIC=72430.01
Diabetes_binary ~ HighBP + HighChol + CholCheck + BMI + Stroke +
HeartDiseaseorAttack + PhysActivity + Fruits + Veggies +
HvyAlcoholConsump + AnyHealthcare + NoDocbcCost + GenHlth +
MentHlth + PhysHlth + DiffWalk + Sex + Age + Education +
Income
Df Deviance AIC
- NoDocbcCost 1 72388 72428
- AnyHealthcare 1 72390 72430
<none> 72388 72430
- PhysActivity 1 72390 72430
- Fruits 1 72391 72431
- Veggies 1 72395 72435
- MentHlth 1 72400 72440
- Education 1 72401 72441
- Stroke 1 72404 72444
- DiffWalk 1 72408 72448
- PhysHlth 1 72437 72477
- HeartDiseaseorAttack 1 72468 72508
- Income 1 72516 72556
- Sex 1 72586 72626
- HvyAlcoholConsump 1 72639 72679
- CholCheck 1 72726 72766
- HighChol 1 73353 73393
- HighBP 1 73765 73805
- Age 1 73983 74023
- BMI 1 75004 75044
- GenHlth 1 75133 75173
Step: AIC=72428.32
Diabetes_binary ~ HighBP + HighChol + CholCheck + BMI + Stroke +
HeartDiseaseorAttack + PhysActivity + Fruits + Veggies +
HvyAlcoholConsump + AnyHealthcare + GenHlth + MentHlth +
PhysHlth + DiffWalk + Sex + Age + Education + Income
Df Deviance AIC
- AnyHealthcare 1 72390 72428
<none> 72388 72428
- PhysActivity 1 72391 72429
- Fruits 1 72391 72429
- Veggies 1 72395 72433
- MentHlth 1 72400 72438
- Education 1 72401 72439
- Stroke 1 72404 72442
- DiffWalk 1 72408 72446
- PhysHlth 1 72437 72475
- HeartDiseaseorAttack 1 72469 72507
- Income 1 72520 72558
- Sex 1 72586 72624
- HvyAlcoholConsump 1 72639 72677
- CholCheck 1 72726 72764
- HighChol 1 73354 73392
- HighBP 1 73766 73804
- Age 1 74007 74045
- BMI 1 75004 75042
- GenHlth 1 75142 75180
Step: AIC=72427.78
Diabetes_binary ~ HighBP + HighChol + CholCheck + BMI + Stroke +
HeartDiseaseorAttack + PhysActivity + Fruits + Veggies +
HvyAlcoholConsump + GenHlth + MentHlth + PhysHlth + DiffWalk +
Sex + Age + Education + Income
Df Deviance AIC
<none> 72390 72428
- PhysActivity 1 72392 72428
- Fruits 1 72393 72429
- Veggies 1 72397 72433
- MentHlth 1 72401 72437
- Education 1 72402 72438
- Stroke 1 72406 72442
- DiffWalk 1 72410 72446
- PhysHlth 1 72438 72474
- HeartDiseaseorAttack 1 72470 72506
- Income 1 72520 72556
- Sex 1 72587 72623
- HvyAlcoholConsump 1 72641 72677
- CholCheck 1 72734 72770
- HighChol 1 73356 73392
- HighBP 1 73768 73804
- Age 1 74053 74089
- BMI 1 75006 75042
- GenHlth 1 75142 75178
cat("\nForward Selection Model Summary:\n")
Forward Selection Model Summary:
tidy_summary_for <- tidy(forward_model)
glance_summary_for <- glance(forward_model)
full_summary_for <- bind_cols(tidy_summary_for, glance_summary_for)
print(full_summary_for)# A tibble: 23 × 13
term estimate std.error statistic p.value null.deviance df.null logLik
<chr> <dbl> <dbl> <dbl> <dbl> <dbl> <int> <dbl>
1 (Interc… -6.87 0.125 -55.1 0 98000. 70691 -36194.
2 HighBP 0.735 0.0197 37.3 9.85e-304 98000. 70691 -36194.
3 HighChol 0.587 0.0189 31.1 7.02e-213 98000. 70691 -36194.
4 CholChe… 1.36 0.0813 16.7 7.15e- 63 98000. 70691 -36194.
5 BMI 0.0756 0.00157 48.0 0 98000. 70691 -36194.
6 Smoker -0.00192 0.0189 -0.101 9.19e- 1 98000. 70691 -36194.
7 Stroke 0.162 0.0409 3.96 7.42e- 5 98000. 70691 -36194.
8 HeartDi… 0.253 0.0284 8.88 6.42e- 19 98000. 70691 -36194.
9 PhysAct… -0.0329 0.0213 -1.55 1.22e- 1 98000. 70691 -36194.
10 Fruits -0.0344 0.0196 -1.75 7.95e- 2 98000. 70691 -36194.
# ℹ 13 more rows
# ℹ 5 more variables: AIC <dbl>, BIC <dbl>, deviance <dbl>, df.residual <int>,
# nobs <int>
cat("\nBackward Selection Model Summary:\n")
Backward Selection Model Summary:
tidy_summary_back <- tidy(backward_model)
glance_summary_back <- glance(backward_model)
full_summary_back <- bind_cols(tidy_summary_back, glance_summary_back)
print(full_summary_back)# A tibble: 19 × 13
term estimate std.error statistic p.value null.deviance df.null logLik
<chr> <dbl> <dbl> <dbl> <dbl> <dbl> <int> <dbl>
1 (Interc… -6.82 0.119 -57.3 0 98000. 70691 -36195.
2 HighBP 0.736 0.0197 37.3 4.53e-304 98000. 70691 -36195.
3 HighChol 0.588 0.0188 31.2 2.14e-213 98000. 70691 -36195.
4 CholChe… 1.37 0.0810 16.9 7.10e- 64 98000. 70691 -36195.
5 BMI 0.0756 0.00157 48.1 0 98000. 70691 -36195.
6 Stroke 0.162 0.0409 3.96 7.35e- 5 98000. 70691 -36195.
7 HeartDi… 0.253 0.0284 8.90 5.80e- 19 98000. 70691 -36195.
8 PhysAct… -0.0326 0.0213 -1.53 1.25e- 1 98000. 70691 -36195.
9 Fruits -0.0346 0.0196 -1.77 7.74e- 2 98000. 70691 -36195.
10 Veggies -0.0611 0.0233 -2.62 8.85e- 3 98000. 70691 -36195.
11 HvyAlco… -0.751 0.0486 -15.5 7.42e- 54 98000. 70691 -36195.
12 GenHlth 0.585 0.0114 51.1 0 98000. 70691 -36195.
13 MentHlth -0.00432 0.00128 -3.38 7.24e- 4 98000. 70691 -36195.
14 PhysHlth -0.00827 0.00119 -6.94 3.83e- 12 98000. 70691 -36195.
15 DiffWalk 0.116 0.0258 4.50 6.93e- 6 98000. 70691 -36195.
16 Sex 0.266 0.0190 14.0 1.28e- 44 98000. 70691 -36195.
17 Age 0.153 0.00383 39.8 0 98000. 70691 -36195.
18 Educati… -0.0360 0.0102 -3.54 4.05e- 4 98000. 70691 -36195.
19 Income -0.0584 0.00513 -11.4 4.68e- 30 98000. 70691 -36195.
# ℹ 5 more variables: AIC <dbl>, BIC <dbl>, deviance <dbl>, df.residual <int>,
# nobs <int>
After printing their summaries, we will select the model with the lowest AIC.
AIC_full <- AIC(model_all)
AIC_forward <- AIC(forward_model)
AIC_backward <- AIC(backward_model)
cat("AIC for Full Model:", AIC_full, "\n")AIC for Full Model: 72433.85
cat("AIC for Forward Model:", AIC_forward, "\n")AIC for Forward Model: 72433.85
cat("AIC for Backward Model:", AIC_backward, "\n")AIC for Backward Model: 72427.78
It can be seen that the model with the lowest AIC is the one created by step backward. As we are intrigued about it, we will check how many variables it contains and which are those variables.
all_variables <- names(coef(model_all))
selected_variables <- names(coef(backward_model)[coef(backward_model) != 0])
not_selected_variables <- setdiff(all_variables, selected_variables)
count_selected_variables <- length(selected_variables)
cat("Number of Variables selected by Backward Model:", count_selected_variables, "\n\n")Number of Variables selected by Backward Model: 19
cat("Selected Variables:", paste(selected_variables, collapse = ", "), "\n\n")Selected Variables: (Intercept), HighBP, HighChol, CholCheck, BMI, Stroke, HeartDiseaseorAttack, PhysActivity, Fruits, Veggies, HvyAlcoholConsump, GenHlth, MentHlth, PhysHlth, DiffWalk, Sex, Age, Education, Income
cat("Variables in 'model_all' but not selected by Backward Model:", paste(not_selected_variables, collapse = ", "), "\n")Variables in 'model_all' but not selected by Backward Model: Smoker, AnyHealthcare, NoDocbcCost, Education_binary
The backward model is selecting 19 variables out of the 23 variables that is using the full model. There are three variables that the most optimal model is not taking into account and this ones are Smoker, AnyHealthcare, NoDocbcCost and Education_binary. This is something that we could have predicted because if we take a look to the summary of the model with all the variables, these three variables are the ones which has the highest p-value. Thus, they are the less significant for the first model.
Saving Plots
plot_3 |>
ggsave(filename = "05_key_plot_3.png", path = "../results/")Saving 7 x 5 in image
library("tidyverse")
library("broom")Load data
df <- read_tsv("../data/03_dat_aug.tsv")Rows: 70692 Columns: 32
── Column specification ────────────────────────────────────────────────────────
Delimiter: "\t"
chr (9): Smoking_Status, Diabetes_Status, Sex_character, Age_Range, Income_...
dbl (23): Diabetes_binary, HighBP, HighChol, CholCheck, BMI, Smoker, Stroke,...
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
head(df)# A tibble: 6 × 32
Diabetes_binary HighBP HighChol CholCheck BMI Smoker Stroke
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 0 1 0 1 26 0 0
2 0 1 1 1 26 1 1
3 0 0 0 1 26 0 0
4 0 1 1 1 28 1 0
5 0 0 0 1 29 1 0
6 0 0 0 1 18 0 0
# ℹ 25 more variables: HeartDiseaseorAttack <dbl>, PhysActivity <dbl>,
# Fruits <dbl>, Veggies <dbl>, HvyAlcoholConsump <dbl>, AnyHealthcare <dbl>,
# NoDocbcCost <dbl>, GenHlth <dbl>, MentHlth <dbl>, PhysHlth <dbl>,
# DiffWalk <dbl>, Sex <dbl>, Age <dbl>, Education <dbl>, Income <dbl>,
# Smoking_Status <chr>, Diabetes_Status <chr>, Sex_character <chr>,
# Age_Range <chr>, Income_Class <chr>, Physically_Active <chr>, Habits <chr>,
# Health_Risk <chr>, Education_binary <dbl>, SE_Background <chr>
The second script of analysis will be related to the performing PCA and the creation of some models with different variations. These variations are based on the creation of three different models. The first of them will use the components used to reach 80% variability after the PCA analysis combined with the use of logistic regression for prediction. The second and third ones will be different GLM for men and women. Thanks to this we will be able to analyze whether gender has a special importance when predicting patients with diabetes.
PCA
It has to be mentioned that we have used the code provided by the lecturers (PCA tidyverse) and apply it using our data to achieve some conclusions. That is why the code will be the achieved conclusions will be completely different. Thus, that is why we want to highlight that we have been able to understand how Principal Component Analysis (PCA) works thanks to this code and how to interpret the obtained results. Lets start with PCA.
PCA is a dimensionality reduction method that is often used to reduce the dimensionality of large data sets, by transforming a large set of variables into a smaller one that still contains most of the information in the large set. PCA can be used for many purposes, however, we have opted to apply this technique to create models with less variables to try to gain more meaningful conclusions of them. Thus, we are using it to reduce the dimensionality to simplify our data set with many variables, to make it more manageable for analysis.
Having said that, first thing that has to be done is to work only with numerical variables. PCA is typically applied to continuous variables, and it assumes that the variables are on a numerical scale. PCA is based on the covariance matrix, which involves computing variances and covariances between numeric variables. Therefore, directly applying PCA to a dataset with categorical variables is not appropriate However, there are techniques that extend PCA to handle categorical variables. One such technique is called Multiple Correspondence Analysis (MCA), which is an extension of PCA for categorical data. MCA is suitable for datasets where variables are categorical and can take more than two levels. As we are not working with categorical data, we will forget about this idea.
column_classes <- df |>
map_chr(class)
numeric_columns <- names(which(column_classes == "numeric")) |>
print() [1] "Diabetes_binary" "HighBP" "HighChol"
[4] "CholCheck" "BMI" "Smoker"
[7] "Stroke" "HeartDiseaseorAttack" "PhysActivity"
[10] "Fruits" "Veggies" "HvyAlcoholConsump"
[13] "AnyHealthcare" "NoDocbcCost" "GenHlth"
[16] "MentHlth" "PhysHlth" "DiffWalk"
[19] "Sex" "Age" "Education"
[22] "Income" "Education_binary"
After checking which are the numerical variables of our data set, we will only select those ones to perform the analysis. Before using PCA, we have to make sure that all our data is playing on a level field. Scaling means adjusting the values so they’re all in a similar range.
pca_fit <- df |>
select_if(where(is.numeric)) |>
prcomp(scale = TRUE)Now, we will create a special kind of map for our data using PCA. To do this, we blend the PCA results with the original data, adding back the information we temporarily set aside. It’s like bringing back the colors to our points based on categories that were there in the first place but were temporarily taken away for PCA. We use a tool called augment() from the “broom” package to make this happen. It needs the model we created and the original data as inputs.
df_augmented <- augment(pca_fit, df)
ggplot(df_augmented, aes(.fittedPC1, .fittedPC2, color = Diabetes_Status)) +
geom_point(size = 3, alpha = 0.7, shape = 16) +
scale_color_manual( values = c("Non-Diabetic" = "#FF5733", "Diabetic" = "#33FF57")) +
theme_minimal() +
theme(panel.grid = element_blank(),
panel.background = element_rect(fill = "lightgray"))